Motivation

Provide an overview of the project goals and motivation

  • The motivation for a project exploring the correlation between low birth weight, premature birth rate, cancer mortality rate, cardiovascular disease hospitalization rate, asthma hospitalization rate and PM2.5 concentration, as well as socioeconomic, racial, education, and gender factors across the 62 counties of New York State, stems from a compelling public health imperative.

  • All the outcomes of interest are critical indicators of population health. By examining the impact of fine particulate pollution, this project aims to uncover environmental health disparities that disproportionately affect under-served communities.

  • It seeks to inform public health policy by highlighting the need for targeted interventions that could mitigate the adverse effects of air pollution on vulnerable populations, including pregnant women and newborns. The analysis of socioeconomic and racial factors will offer insights into the complex interplay between environment and social determinants of health, potentially guiding urban planning and healthcare resource allocation.

  • Furthermore, by considering gender-specific vulnerabilities,the project aims to provide a comprehensive overview of the risk landscape, fostering community awareness and prompting action to improve air quality and health equity in one of the most densely populated urban areas in the world.

Initial Question

We aim to assess the impact of PM2.5 concentration on various health outcomes across New York state, encompassing neonatal health to the well-being of mature individuals. Our investigation involves not only preliminary data analysis but also the implementation of regression models, accounting for demographic characteristics. Through this comprehensive approach, we seek to understand the individual contributions of different variables to health outcomes. Ultimately, our goal is to identify the most effective model for predicting these health outcomes, with the potential to provide valuable insights for public health benefits.

Data

Source, scraping method, cleaning, etc.

Data Import

Click show on the right to view the code chunk for data importing.

lowbirthweight <- read_csv("csv_NYC_lowbirthweight.csv")
pm2_5 <- read_csv("pm2.5.csv")
edu_NY <- read_excel("Edu_NY.xlsx")
race_NY <- read_excel("Race_NY.xlsx")
HHincome_NY <- read_excel("HHincome_NY.xlsx")
Age_NY <- read_excel("Age_NY.xlsx")
Sex_NY <- read_excel("Sex_NY.xlsx")
health <- read_excel("chir_current_data.xlsx")
uscounties <- read_csv("uscounties.csv") #Simplemaps.com

9 data files are obtained from publicly available sources online. These will be imported, wrangled properly, and merged to create a complete data for further analysis.

Data Wrangling

Click show on the right to view the code chunk for data importing.

PM2.5 Dataset

pm2_5 <- pm2_5 %>% 
  janitor::clean_names()%>%
  select (county,value) %>%
  rename (annual_pm2.5 = "value")

pm2_5 %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county annual_pm2.5
Albany 7.9
Allegany 6.5
Bronx 8.3
Broome 6.9
Cattaraugus 7.3
Cayuga 6.4
Chautauqua 6.6
Chemung 6.2
Chenango 6.5
Clinton 7.0
Columbia 7.0
Cortland 6.1
Delaware 6.6
Dutchess 7.7
Erie 7.6
Essex 3.4
Franklin 7.4
Fulton 6.7
Genesee 7.3
Greene 6.8
Hamilton 6.0
Herkimer 6.5
Jefferson 6.8
Kings 7.9
Lewis 6.1
Livingston 6.7
Madison 6.1
Monroe 7.4
Montgomery 6.9
Nassau 8.7
New York 10.4
Niagara 8.7
Oneida 6.5
Onondaga 5.1
Ontario 6.6
Orange 6.4
Orleans 7.3
Oswego 6.4
Otsego 6.4
Putnam 7.9
Queens 8.3
Rensselaer 7.5
Richmond 7.8
Rockland 8.8
Saratoga 7.7
Schenectady 7.5
Schoharie 6.5
Schuyler 6.0
Seneca 6.4
St. Lawrence 7.7
Steuben 4.8
Suffolk 6.8
Sullivan 7.2
Tioga 6.6
Tompkins 6.2
Ulster 7.4
Warren 7.2
Washington 7.3
Wayne 6.7
Westchester 9.3
Wyoming 7.0
Yates 6.0

The dataset is obtained from EPH Tracking Website from CDC (https://ephtracking.cdc.gov/DataExplorer/). This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- annual_pm2.5: annual estimated PM2.5 concentration at each county (ug/m^3)

Demographic Dataset

The following set of demographic datasets is obtained from Census Reporter webpage (https://censusreporter.org/).

Education Attainment

edu <- edu_NY %>% 
  janitor::clean_names()%>%
  mutate (percentage_high_education = (bach_male+master_male+prof_male+doct_male+bach_female+master_female+prof_female+doct_female)/total) %>%
  filter(str_detect (name, " County, NY")) %>% 
  mutate(county = str_replace (name, " County, NY", "")) %>%
  select(county,percentage_high_education) %>%
  mutate(county = str_replace (county, "St.", "St")) %>%
  mutate(county = str_replace (county, "Stuben", "Steuben"))%>%
  select(county,percentage_high_education)

edu %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county percentage_high_education
Albany 0.443
Allegany 0.224
Bronx 0.209
Broome 0.289
Cattaraugus 0.200
Cayuga 0.222
Chautauqua 0.233
Chemung 0.239
Chenango 0.183
Clinton 0.254
Columbia 0.349
Cortland 0.281
Delaware 0.230
Dutchess 0.377
Erie 0.359
Essex 0.305
Franklin 0.216
Fulton 0.182
Genesee 0.220
Greene 0.256
Hamilton 0.205
Herkimer 0.214
Jefferson 0.243
Kings 0.395
Lewis 0.188
Livingston 0.289
Madison 0.262
Monroe 0.400
Montgomery 0.192
Nassau 0.475
New York 0.626
Niagara 0.263
Oneida 0.262
Onondaga 0.365
Ontario 0.375
Orange 0.314
Orleans 0.172
Oswego 0.208
Otsego 0.333
Putnam 0.415
Queens 0.339
Rensselaer 0.343
Richmond 0.349
Rockland 0.420
St Lawrence 0.236
Saratoga 0.427
Schenectady 0.333
Schoharie 0.232
Schuyler 0.243
Seneca 0.215
Steuben 0.255
Suffolk 0.383
Sullivan 0.270
Tioga 0.258
Tompkins 0.553
Ulster 0.345
Warren 0.331
Washington 0.206
Wayne 0.244
Westchester 0.506
Wyoming 0.181
Yates 0.245

This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- percentage_high_education: percentage of population who finished a higher education level (higher than a bachelor degree)

Ethnicity

race_non_hisp_white <- race_NY %>% 
  janitor::clean_names()%>%
  filter( x1 == "White Non-Hispanic") %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_non_hisp_white") %>% 
  select (county,percent_non_hisp_white) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_non_hisp_white ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_non_hisp_black <- race_NY %>% 
  janitor::clean_names()%>%
  filter( x1 == "Black Non-Hispanic") %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_non_hisp_black") %>% 
  select (county,percent_non_hisp_black) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_non_hisp_black ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_hisp_white <- race_NY %>% 
  janitor::clean_names()%>%
  filter( x1 == "White Hispanic") %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_hisp_white") %>% 
  select (county,percent_hisp_white) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_hisp_white ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_hisp_black <- race_NY %>% 
  janitor::clean_names()%>%
  filter( x1 == "Black Hispanic") %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_hisp_black") %>% 
  select (county,percent_hisp_black) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_hisp_black ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_merge <- race_non_hisp_white %>%
  inner_join(race_non_hisp_black, by = "county") %>%
  inner_join(race_hisp_white, by = "county") %>%
  inner_join(race_hisp_black, by = "county")

race_merge <- race_merge %>% 
  mutate (percent_other = 1 - percent_non_hisp_white - percent_non_hisp_black - percent_hisp_white - percent_hisp_black)

race_merge %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county percent_non_hisp_white percent_non_hisp_black percent_hisp_white percent_hisp_black percent_other
Albany 0.710 0.114 0.027 0.007 0.142
Allegany 0.941 0.018 0.009 0.001 0.031
Bronx 0.090 0.285 0.109 0.060 0.456
Broome 0.824 0.050 0.018 0.003 0.105
Cattaraugus 0.897 0.013 0.013 0.001 0.076
Cayuga 0.888 0.039 0.013 0.001 0.059
Chautauqua 0.865 0.023 0.033 0.005 0.074
Chemung 0.853 0.058 0.016 0.002 0.072
Chenango 0.941 0.006 0.014 0.001 0.038
Clinton 0.897 0.039 0.007 0.001 0.056
Columbia 0.848 0.039 0.022 0.002 0.089
Cortland 0.914 0.019 0.016 0.000 0.051
Delaware 0.912 0.016 0.022 0.002 0.048
Dutchess 0.699 0.099 0.053 0.007 0.141
Erie 0.746 0.124 0.020 0.004 0.107
Essex 0.912 0.029 0.010 0.003 0.046
Franklin 0.811 0.051 0.008 0.001 0.129
Fulton 0.906 0.014 0.016 0.001 0.063
Genesee 0.900 0.020 0.012 0.001 0.066
Greene 0.838 0.053 0.035 0.004 0.070
Hamilton 0.921 0.010 0.011 0.000 0.058
Herkimer 0.932 0.012 0.012 0.000 0.045
Jefferson 0.809 0.055 0.039 0.004 0.093
Kings 0.367 0.286 0.054 0.018 0.274
Lewis 0.946 0.009 0.005 0.007 0.032
Livingston 0.895 0.026 0.017 0.002 0.060
Madison 0.917 0.015 0.011 0.000 0.056
Monroe 0.698 0.140 0.041 0.009 0.113
Montgomery 0.794 0.026 0.056 0.004 0.120
Nassau 0.583 0.110 0.063 0.006 0.239
New York 0.467 0.121 0.068 0.019 0.324
Niagara 0.845 0.066 0.017 0.002 0.071
Oneida 0.811 0.057 0.027 0.003 0.101
Onondaga 0.760 0.103 0.020 0.006 0.110
Ontario 0.891 0.020 0.022 0.003 0.064
Orange 0.622 0.100 0.085 0.010 0.183
Orleans 0.857 0.056 0.021 0.001 0.065
Oswego 0.928 0.010 0.012 0.001 0.050
Otsego 0.904 0.018 0.018 0.002 0.058
Putnam 0.764 0.029 0.068 0.006 0.133
Queens 0.246 0.168 0.090 0.010 0.486
Rensselaer 0.820 0.061 0.020 0.004 0.094
Richmond 0.596 0.091 0.097 0.010 0.206
Rockland 0.629 0.108 0.065 0.007 0.191
St Lawrence 0.917 0.025 0.006 0.001 0.050
Saratoga 0.894 0.015 0.019 0.002 0.069
Schenectady 0.716 0.094 0.024 0.005 0.162
Schoharie 0.910 0.013 0.015 0.003 0.059
Schuyler 0.944 0.009 0.009 0.001 0.038
Seneca 0.879 0.050 0.018 0.002 0.051
Steuben 0.929 0.016 0.009 0.001 0.046
Suffolk 0.664 0.069 0.098 0.006 0.162
Sullivan 0.702 0.080 0.066 0.005 0.147
Tioga 0.943 0.009 0.010 0.000 0.037
Tompkins 0.769 0.035 0.023 0.005 0.168
Ulster 0.760 0.056 0.040 0.004 0.139
Warren 0.930 0.012 0.014 0.002 0.043
Washington 0.911 0.027 0.011 0.003 0.048
Wayne 0.889 0.021 0.019 0.001 0.071
Westchester 0.524 0.133 0.075 0.013 0.255
Wyoming 0.888 0.040 0.015 0.002 0.056
Yates 0.948 0.008 0.015 0.000 0.029

This has 62 rows and 6 columns of data. In which, 6 variables are:
- county: NY county name
- percent_non_hisp_white: percentage of population who are identified as Non-Hispanic White
- percent_non_hisp_black: percentage of population who are identified as Non-Hispanic Black
- percent_hisp_white: percentage of population who are identified as Hispanic White
- percent_hisp_black: percentage of population who are identified as Hispanic Black
- percent_other: percentage of population who are identified as any other ethinic groups

Household Income

income <- HHincome_NY %>%
  janitor::clean_names() %>% 
  filter (x1 == "percent_high_income") %>%
   pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_high_income") %>%
  select (county,percent_high_income) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_high_income ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_merge %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county percent_non_hisp_white percent_non_hisp_black percent_hisp_white percent_hisp_black percent_other
Albany 0.710 0.114 0.027 0.007 0.142
Allegany 0.941 0.018 0.009 0.001 0.031
Bronx 0.090 0.285 0.109 0.060 0.456
Broome 0.824 0.050 0.018 0.003 0.105
Cattaraugus 0.897 0.013 0.013 0.001 0.076
Cayuga 0.888 0.039 0.013 0.001 0.059
Chautauqua 0.865 0.023 0.033 0.005 0.074
Chemung 0.853 0.058 0.016 0.002 0.072
Chenango 0.941 0.006 0.014 0.001 0.038
Clinton 0.897 0.039 0.007 0.001 0.056
Columbia 0.848 0.039 0.022 0.002 0.089
Cortland 0.914 0.019 0.016 0.000 0.051
Delaware 0.912 0.016 0.022 0.002 0.048
Dutchess 0.699 0.099 0.053 0.007 0.141
Erie 0.746 0.124 0.020 0.004 0.107
Essex 0.912 0.029 0.010 0.003 0.046
Franklin 0.811 0.051 0.008 0.001 0.129
Fulton 0.906 0.014 0.016 0.001 0.063
Genesee 0.900 0.020 0.012 0.001 0.066
Greene 0.838 0.053 0.035 0.004 0.070
Hamilton 0.921 0.010 0.011 0.000 0.058
Herkimer 0.932 0.012 0.012 0.000 0.045
Jefferson 0.809 0.055 0.039 0.004 0.093
Kings 0.367 0.286 0.054 0.018 0.274
Lewis 0.946 0.009 0.005 0.007 0.032
Livingston 0.895 0.026 0.017 0.002 0.060
Madison 0.917 0.015 0.011 0.000 0.056
Monroe 0.698 0.140 0.041 0.009 0.113
Montgomery 0.794 0.026 0.056 0.004 0.120
Nassau 0.583 0.110 0.063 0.006 0.239
New York 0.467 0.121 0.068 0.019 0.324
Niagara 0.845 0.066 0.017 0.002 0.071
Oneida 0.811 0.057 0.027 0.003 0.101
Onondaga 0.760 0.103 0.020 0.006 0.110
Ontario 0.891 0.020 0.022 0.003 0.064
Orange 0.622 0.100 0.085 0.010 0.183
Orleans 0.857 0.056 0.021 0.001 0.065
Oswego 0.928 0.010 0.012 0.001 0.050
Otsego 0.904 0.018 0.018 0.002 0.058
Putnam 0.764 0.029 0.068 0.006 0.133
Queens 0.246 0.168 0.090 0.010 0.486
Rensselaer 0.820 0.061 0.020 0.004 0.094
Richmond 0.596 0.091 0.097 0.010 0.206
Rockland 0.629 0.108 0.065 0.007 0.191
St Lawrence 0.917 0.025 0.006 0.001 0.050
Saratoga 0.894 0.015 0.019 0.002 0.069
Schenectady 0.716 0.094 0.024 0.005 0.162
Schoharie 0.910 0.013 0.015 0.003 0.059
Schuyler 0.944 0.009 0.009 0.001 0.038
Seneca 0.879 0.050 0.018 0.002 0.051
Steuben 0.929 0.016 0.009 0.001 0.046
Suffolk 0.664 0.069 0.098 0.006 0.162
Sullivan 0.702 0.080 0.066 0.005 0.147
Tioga 0.943 0.009 0.010 0.000 0.037
Tompkins 0.769 0.035 0.023 0.005 0.168
Ulster 0.760 0.056 0.040 0.004 0.139
Warren 0.930 0.012 0.014 0.002 0.043
Washington 0.911 0.027 0.011 0.003 0.048
Wayne 0.889 0.021 0.019 0.001 0.071
Westchester 0.524 0.133 0.075 0.013 0.255
Wyoming 0.888 0.040 0.015 0.002 0.056
Yates 0.948 0.008 0.015 0.000 0.029

This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- percent_high_income: percentage of population who are at higher income households (>$75,000 annually)

Median Age

age <- Age_NY %>% 
  janitor::clean_names() %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "median_age") %>%
  select (county,median_age) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,median_age ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

age %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county median_age
Albany 38.0
Allegany 39.2
Bronx 34.8
Broome 39.7
Cattaraugus 42.3
Cayuga 43.3
Chautauqua 42.8
Chemung 41.4
Chenango 44.7
Clinton 40.7
Columbia 48.9
Cortland 36.5
Delaware 48.5
Dutchess 42.5
Erie 40.2
Essex 48.3
Franklin 40.6
Fulton 44.3
Genesee 42.6
Greene 46.6
Hamilton 56.4
Herkimer 44.3
Jefferson 32.9
Kings 35.7
Lewis 42.0
Livingston 41.7
Madison 42.5
Monroe 39.2
Montgomery 40.5
Nassau 41.8
New York 38.1
Niagara 43.3
Oneida 40.8
Onondaga 39.3
Ontario 43.6
Orange 37.0
Orleans 43.2
Oswego 40.7
Otsego 42.2
Putnam 44.2
Queens 39.6
Rensselaer 40.1
Richmond 40.4
Rockland 35.6
St Lawrence 39.0
Saratoga 43.2
Schenectady 39.9
Schoharie 45.7
Schuyler 47.1
Seneca 42.5
Steuben 42.9
Suffolk 41.7
Sullivan 42.3
Tioga 44.5
Tompkins 31.8
Ulster 44.2
Warren 46.8
Washington 44.4
Wayne 43.9
Westchester 41.2
Wyoming 42.6
Yates 40.9

This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- median_age: median age of the population at each county

Sex

sex <- Sex_NY %>% 
  janitor::clean_names() %>% 
  filter (x1 == "Male:") %>%
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_male") %>%
  select (county,percent_male) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_male ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

sex%>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county percent_male
Albany 0.486
Allegany 0.506
Bronx 0.474
Broome 0.495
Cattaraugus 0.501
Cayuga 0.516
Chautauqua 0.497
Chemung 0.499
Chenango 0.502
Clinton 0.517
Columbia 0.503
Cortland 0.493
Delaware 0.507
Dutchess 0.500
Erie 0.487
Essex 0.522
Franklin 0.546
Fulton 0.501
Genesee 0.504
Greene 0.527
Hamilton 0.506
Herkimer 0.499
Jefferson 0.526
Kings 0.476
Lewis 0.504
Livingston 0.508
Madison 0.496
Monroe 0.485
Montgomery 0.496
Nassau 0.491
New York 0.476
Niagara 0.490
Oneida 0.502
Onondaga 0.485
Ontario 0.492
Orange 0.504
Orleans 0.503
Oswego 0.503
Otsego 0.488
Putnam 0.503
Queens 0.488
Rensselaer 0.497
Richmond 0.488
Rockland 0.494
St Lawrence 0.513
Saratoga 0.496
Schenectady 0.490
Schoharie 0.500
Schuyler 0.504
Seneca 0.527
Steuben 0.502
Suffolk 0.495
Sullivan 0.514
Tioga 0.499
Tompkins 0.497
Ulster 0.499
Warren 0.495
Washington 0.522
Wayne 0.500
Westchester 0.487
Wyoming 0.534
Yates 0.483

This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- percent_male: percentage of population who are identified as male

Low Birthweight Data

lowbirthweight <- lowbirthweight %>% 
  janitor::clean_names()%>%
  select (region_county, percentage)%>%
  rename (county = "region_county", percent_lowbirthweight = "percentage") 

lowbirthweight %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county percent_lowbirthweight
Nassau 7.9
Suffolk 8.0
Bronx 10.2
Kings 7.8
New York 8.0
Queens 8.7
Richmond 7.8
Dutchess 7.3
Orange 7.0
Putnam 6.6
Rockland 5.6
Sullivan 8.6
Ulster 7.7
Westchester 7.8
Albany 8.2
Columbia 6.9
Greene 6.5
Rensselaer 8.6
Saratoga 6.5
Schenectady 9.2
Fulton 8.8
Herkimer 6.4
Montgomery 8.0
Otsego 5.6
Schoharie 5.1
Clinton 8.1
Essex 7.3
Franklin 6.7
Hamilton 0.0
Warren 6.9
Washington 8.3
Jefferson 7.3
Lewis 6.1
St. Lawrence 7.2
Cayuga 7.2
Cortland 7.5
Madison 7.4
Oneida 7.9
Onondaga 8.2
Oswego 6.0
Broome 7.9
Chenango 7.6
Delaware 7.2
Tioga 7.1
Tompkins 5.7
Chemung 7.9
Livingston 7.2
Monroe 9.1
Ontario 7.0
Schuyler 2.8
Seneca 6.8
Steuben 7.2
Wayne 6.8
Yates 4.0
Allegany 6.7
Cattaraugus 7.8
Chautauqua 8.4
Erie 8.8
Genesee 6.8
Niagara 8.2
Orleans 6.5
Wyoming 8.2
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA

This has 87 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- percent_lowbirthweight: percentage of children being born being identified as low birth weight (<2,500g) (https://www.health.ny.gov/)

Health Indicator Dataset

The following dataset is obtained from New York State Department of Health (https://www.health.ny.gov/) that contains 4 different health indicators that will complement with the low birthweight. These 5 will act as our outcomes of interest for our regression models.

health <- health %>%
  janitor::clean_names()%>%
  select (geographic_area,indicator_title,topic_area,rate_percent,measurement) %>%
  filter( str_detect (geographic_area, " County")) %>% 
  mutate (county = str_replace (geographic_area, " County", "")) %>%
  select (county, everything()) %>%
  select (-geographic_area) %>%
  filter(topic_area == "Cancer Indicators" | topic_area == "Respiratory Disease Indicators" | topic_area == "Cardiovascular Disease Indicators" | topic_area == "Maternal and Infant Health Indicators")
  
cancer <- health %>% 
  filter (topic_area == "Cancer Indicators") %>% 
  filter (indicator_title == "All cancer incidence rate per 100,000") %>% 
  select (-c(indicator_title, topic_area, measurement)) %>%
  mutate(rate_percent = as.numeric(rate_percent)/10) %>%
  rename (cancer_mortality_per_10k = "rate_percent")

resp <- health %>%
  filter (topic_area == "Respiratory Disease Indicators") %>%
  filter (indicator_title == "Asthma hospitalization rate per 10,000") %>% 
  select (-c(indicator_title, topic_area, measurement)) %>% 
  rename (asthma_hosp_rate_per_10k = "rate_percent")

cardio <- health %>%
  filter (topic_area == "Cardiovascular Disease Indicators") %>%
  filter (indicator_title == "Cardiovascular disease hospitalization rate per 10,000") %>% 
  select (-c(indicator_title, topic_area, measurement)) %>% 
  rename (cardio_hosp_rate_per_10k = "rate_percent")

maternal <- health %>%
  filter (topic_area == "Maternal and Infant Health Indicators") %>%
  filter (indicator_title == "Percentage of premature births with <37 weeks gestation") %>% 
  select (-c(indicator_title, topic_area, measurement)) %>% 
  rename (premature_percentage = "rate_percent")
health_merge <- maternal %>% 
  inner_join(cancer, by = "county") %>%
  inner_join(resp, by = "county") %>% 
  inner_join(cardio, by = "county") 

health_merge %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county premature_percentage cancer_mortality_per_10k asthma_hosp_rate_per_10k cardio_hosp_rate_per_10k
Albany 9.7 65.59 5.7 148.9
Allegany 8.8 66.18 3.1 151.8
Broome 9.4 69.59 5.0 171.8
Cattaraugus 8.4 68.99 1.9 159.0
Cayuga 8.8 72.01 2.2 169.5
Chautauqua 9.3 73.68 2.6 90.2
Chemung 8.4 72.31 4.2 161.9
Chenango 8.2 73.57 2.0 177.5
Clinton 8.7 68.46 2.5 147.1
Columbia 9.0 82.23 4.7 168.6
Cortland 8.9 68.34 4.7 158.7
Delaware 8.7 75.42 2.8 181.5
Dutchess 9.0 62.68 6.5 141.6
Erie 10.2 71.48 5.1 159.5
Essex 8.9 72.62 0.5* 105.8
Franklin 8.5 67.53 1.9 117.5
Fulton 9.0 77.05 1.7 161.5
Genesee 9.3 72.36 2.6 166.9
Greene 8.8 79.08 4.8 192.3
Hamilton s NA 0.0* 175.2
Herkimer 7.5 70.15 2.6 178.2
Jefferson 8.8 59.69 3.8 142.9
Lewis 7.0 65.46 3.0 154.7
Livingston 7.8 68.43 2.2 127.8
Madison 9.7 62.76 1.8 122.3
Monroe 10.0 65.74 7.4 152.4
Montgomery 9.0 76.86 3.0 200.1
Nassau 9.2 67.06 7.1 169.5
Niagara 10.0 78.86 3.8 182.8
Oneida 9.8 69.56 3.7 176.6
Onondaga 9.2 66.68 5.2 145.8
Ontario 9.1 76.79 2.2 147.1
Orange 8.3 53.48 6.7 141.8
Orleans 9.6 76.96 1.6 158.1
Oswego 7.9 69.43 3.3 161.8
Otsego 7.5 68.94 2.5 178.7
Putnam 8.7 64.47 4.6 114.9
Rensselaer 10.2 66.05 3.5 148.5
Rockland 6.3 54.49 4.5 104.4
St. Lawrence 8.7 62.57 3.2 157.0
Saratoga 8.1 68.64 2.6 130.9
Schenectady 10.3 64.28 5.0 153.4
Schoharie 8.5 69.63 2.5 142.6
Schuyler 5.2 77.66 1.5* 138.4
Seneca 8.4 70.07 1.7 142.1
Steuben 8.0 69.41 2.7 135.1
Suffolk 10.0 68.67 5.1 173.8
Sullivan 10.3 61.70 5.8 181.8
Tioga 8.6 76.21 2.1 84.3
Tompkins 6.5 49.56 1.7 80.4
Ulster 9.0 66.57 5.1 162.6
Warren 10.2 80.66 4.3 173.6
Washington 11.1 75.01 3.8 161.6
Wayne 8.8 72.46 4.3 191.0
Westchester 9.1 60.86 7.1 136.8
Wyoming 10.3 75.61 3.5 155.1
Yates 6.0 73.05 2.0 130.7
Bronx 10.3 45.87 28.4 172.5
Kings 8.5 47.96 10.7 132.1
New York 8.7 53.95 8.4 117.3
Queens 9.1 51.60 7.8 141.7
Richmond 9.2 65.15 7.9 160.8

They are:
- cancer_mortality_per_10k: percentage of cancer mortality per 100 thousands people in each NY county (Cancer Indicator)
- asthma_hosp_rate_per_10k: percentage of asthma hospitalization per 10 thousands people in each NY county (Respiratory Disease Indicator)
- cardio_hosp_rate_per_10k“: percentage of cardiovascular-disease-related hospitalization per 10 thousands people in each NY county (Cardiovascular Disease Indicator)
- premature_percentage: percentage of children being born prematurely (<37 gestational weeks) in each NY county

Merge dataset

Here we perform inner_join() to create 1 bigger dataset called demographic_merge. Then, we join them with our lowbirthweight & pm2_5 to make a finalized data frame called merge. And, we will use this for regression model.

demographic_merge <- age %>%
  inner_join(sex,  by = "county") %>%
  inner_join(income, by = "county") %>% 
  inner_join(race_merge, by = "county") %>% 
  inner_join(edu, by = "county") %>%
  mutate(county = str_replace (county, "St ", "St. "))
  

merge <- lowbirthweight %>% 
  inner_join(pm2_5, by = "county") %>%
  inner_join(health_merge, by = "county") %>% 
  inner_join(demographic_merge, by = "county") 

merge <- merge %>% 
  select (county, annual_pm2.5,everything())%>%
  mutate(asthma_hosp_rate_per_10k = as.numeric(asthma_hosp_rate_per_10k))%>%
  mutate(cardio_hosp_rate_per_10k = as.numeric(cardio_hosp_rate_per_10k)) %>%
  mutate(premature_percentage = as.numeric(premature_percentage))%>%
  mutate(cancer_mortality_per_10k = as.numeric(cancer_mortality_per_10k))

merge %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
county annual_pm2.5 percent_lowbirthweight premature_percentage cancer_mortality_per_10k asthma_hosp_rate_per_10k cardio_hosp_rate_per_10k median_age percent_male percent_high_income percent_non_hisp_white percent_non_hisp_black percent_hisp_white percent_hisp_black percent_other percentage_high_education
Nassau 8.7 7.9 9.2 67.06 7.1 169.5 41.8 0.491 0.712 0.583 0.110 0.063 0.006 0.239 0.475
Suffolk 6.8 8.0 10.0 68.67 5.1 173.8 41.7 0.495 0.669 0.664 0.069 0.098 0.006 0.162 0.383
Bronx 8.3 10.2 10.3 45.87 28.4 172.5 34.8 0.474 0.296 0.090 0.285 0.109 0.060 0.456 0.209
Kings 7.9 7.8 8.5 47.96 10.7 132.1 35.7 0.476 0.466 0.367 0.286 0.054 0.018 0.274 0.395
New York 10.4 8.0 8.7 53.95 8.4 117.3 38.1 0.476 0.572 0.467 0.121 0.068 0.019 0.324 0.626
Queens 8.3 8.7 9.1 51.60 7.8 141.7 39.6 0.488 0.505 0.246 0.168 0.090 0.010 0.486 0.339
Richmond 7.8 7.8 9.2 65.15 7.9 160.8 40.4 0.488 0.567 0.596 0.091 0.097 0.010 0.206 0.349
Dutchess 7.7 7.3 9.0 62.68 6.5 141.6 42.5 0.500 0.561 0.699 0.099 0.053 0.007 0.141 0.377
Orange 6.4 7.0 8.3 53.48 6.7 141.8 37.0 0.504 0.558 0.622 0.100 0.085 0.010 0.183 0.314
Putnam 7.9 6.6 8.7 64.47 4.6 114.9 44.2 0.503 0.672 0.764 0.029 0.068 0.006 0.133 0.415
Rockland 8.8 5.6 6.3 54.49 4.5 104.4 35.6 0.494 0.612 0.629 0.108 0.065 0.007 0.191 0.420
Sullivan 7.2 8.6 10.3 61.70 5.8 181.8 42.3 0.514 0.418 0.702 0.080 0.066 0.005 0.147 0.270
Ulster 7.4 7.7 9.0 66.57 5.1 162.6 44.2 0.499 0.476 0.760 0.056 0.040 0.004 0.139 0.345
Westchester 9.3 7.8 9.1 60.86 7.1 136.8 41.2 0.487 0.624 0.524 0.133 0.075 0.013 0.255 0.506
Albany 7.9 8.2 9.7 65.59 5.7 148.9 38.0 0.486 0.494 0.710 0.114 0.027 0.007 0.142 0.443
Columbia 7.0 6.9 9.0 82.23 4.7 168.6 48.9 0.503 0.489 0.848 0.039 0.022 0.002 0.089 0.349
Greene 6.8 6.5 8.8 79.08 4.8 192.3 46.6 0.527 0.419 0.838 0.053 0.035 0.004 0.070 0.256
Rensselaer 7.5 8.6 10.2 66.05 3.5 148.5 40.1 0.497 0.514 0.820 0.061 0.020 0.004 0.094 0.343
Saratoga 7.7 6.5 8.1 68.64 2.6 130.9 43.2 0.496 0.591 0.894 0.015 0.019 0.002 0.069 0.427
Schenectady 7.5 9.2 10.3 64.28 5.0 153.4 39.9 0.490 0.472 0.716 0.094 0.024 0.005 0.162 0.333
Fulton 6.7 8.8 9.0 77.05 1.7 161.5 44.3 0.501 0.364 0.906 0.014 0.016 0.001 0.063 0.182
Herkimer 6.5 6.4 7.5 70.15 2.6 178.2 44.3 0.499 0.410 0.932 0.012 0.012 0.000 0.045 0.214
Montgomery 6.9 8.0 9.0 76.86 3.0 200.1 40.5 0.496 0.345 0.794 0.026 0.056 0.004 0.120 0.192
Otsego 6.4 5.6 7.5 68.94 2.5 178.7 42.2 0.488 0.400 0.904 0.018 0.018 0.002 0.058 0.333
Schoharie 6.5 5.1 8.5 69.63 2.5 142.6 45.7 0.500 0.423 0.910 0.013 0.015 0.003 0.059 0.232
Clinton 7.0 8.1 8.7 68.46 2.5 147.1 40.7 0.517 0.408 0.897 0.039 0.007 0.001 0.056 0.254
Essex 3.4 7.3 8.9 72.62 NA 105.8 48.3 0.522 0.408 0.912 0.029 0.010 0.003 0.046 0.305
Franklin 7.4 6.7 8.5 67.53 1.9 117.5 40.6 0.546 0.358 0.811 0.051 0.008 0.001 0.129 0.216
Hamilton 6.0 0.0 NA NA NA 175.2 56.4 0.506 0.432 0.921 0.010 0.011 0.000 0.058 0.205
Warren 7.2 6.9 10.2 80.66 4.3 173.6 46.8 0.495 0.466 0.930 0.012 0.014 0.002 0.043 0.331
Washington 7.3 8.3 11.1 75.01 3.8 161.6 44.4 0.522 0.416 0.911 0.027 0.011 0.003 0.048 0.206
Jefferson 6.8 7.3 8.8 59.69 3.8 142.9 32.9 0.526 0.377 0.809 0.055 0.039 0.004 0.093 0.243
Lewis 6.1 6.1 7.0 65.46 3.0 154.7 42.0 0.504 0.390 0.946 0.009 0.005 0.007 0.032 0.188
St. Lawrence 7.7 7.2 8.7 62.57 3.2 157.0 39.0 0.513 0.356 0.917 0.025 0.006 0.001 0.050 0.236
Cayuga 6.4 7.2 8.8 72.01 2.2 169.5 43.3 0.516 0.397 0.888 0.039 0.013 0.001 0.059 0.222
Cortland 6.1 7.5 8.9 68.34 4.7 158.7 36.5 0.493 0.404 0.914 0.019 0.016 0.000 0.051 0.281
Madison 6.1 7.4 9.7 62.76 1.8 122.3 42.5 0.496 0.417 0.917 0.015 0.011 0.000 0.056 0.262
Oneida 6.5 7.9 9.8 69.56 3.7 176.6 40.8 0.502 0.409 0.811 0.057 0.027 0.003 0.101 0.262
Onondaga 5.1 8.2 9.2 66.68 5.2 145.8 39.3 0.485 0.443 0.760 0.103 0.020 0.006 0.110 0.365
Oswego 6.4 6.0 7.9 69.43 3.3 161.8 40.7 0.503 0.416 0.928 0.010 0.012 0.001 0.050 0.208
Broome 6.9 7.9 9.4 69.59 5.0 171.8 39.7 0.495 0.362 0.824 0.050 0.018 0.003 0.105 0.289
Chenango 6.5 7.6 8.2 73.57 2.0 177.5 44.7 0.502 0.364 0.941 0.006 0.014 0.001 0.038 0.183
Delaware 6.6 7.2 8.7 75.42 2.8 181.5 48.5 0.507 0.334 0.912 0.016 0.022 0.002 0.048 0.230
Tioga 6.6 7.1 8.6 76.21 2.1 84.3 44.5 0.499 0.428 0.943 0.009 0.010 0.000 0.037 0.258
Tompkins 6.2 5.7 6.5 49.56 1.7 80.4 31.8 0.497 0.427 0.769 0.035 0.023 0.005 0.168 0.553
Chemung 6.2 7.9 8.4 72.31 4.2 161.9 41.4 0.499 0.377 0.853 0.058 0.016 0.002 0.072 0.239
Livingston 6.7 7.2 7.8 68.43 2.2 127.8 41.7 0.508 0.428 0.895 0.026 0.017 0.002 0.060 0.289
Monroe 7.4 9.1 10.0 65.74 7.4 152.4 39.2 0.485 0.444 0.698 0.140 0.041 0.009 0.113 0.400
Ontario 6.6 7.0 9.1 76.79 2.2 147.1 43.6 0.492 0.474 0.891 0.020 0.022 0.003 0.064 0.375
Schuyler 6.0 2.8 5.2 77.66 NA 138.4 47.1 0.504 0.411 0.944 0.009 0.009 0.001 0.038 0.243
Seneca 6.4 6.8 8.4 70.07 1.7 142.1 42.5 0.527 0.373 0.879 0.050 0.018 0.002 0.051 0.215
Steuben 4.8 7.2 8.0 69.41 2.7 135.1 42.9 0.502 0.384 0.929 0.016 0.009 0.001 0.046 0.255
Wayne 6.7 6.8 8.8 72.46 4.3 191.0 43.9 0.500 0.424 0.889 0.021 0.019 0.001 0.071 0.244
Yates 6.0 4.0 6.0 73.05 2.0 130.7 40.9 0.483 0.372 0.948 0.008 0.015 0.000 0.029 0.245
Allegany 6.5 6.7 8.8 66.18 3.1 151.8 39.2 0.506 0.342 0.941 0.018 0.009 0.001 0.031 0.224
Cattaraugus 7.3 7.8 8.4 68.99 1.9 159.0 42.3 0.501 0.335 0.897 0.013 0.013 0.001 0.076 0.200
Chautauqua 6.6 8.4 9.3 73.68 2.6 90.2 42.8 0.497 0.323 0.865 0.023 0.033 0.005 0.074 0.233
Erie 7.6 8.8 10.2 71.48 5.1 159.5 40.2 0.487 0.419 0.746 0.124 0.020 0.004 0.107 0.359
Genesee 7.3 6.8 9.3 72.36 2.6 166.9 42.6 0.504 0.419 0.900 0.020 0.012 0.001 0.066 0.220
Niagara 8.7 8.2 10.0 78.86 3.8 182.8 43.3 0.490 0.404 0.845 0.066 0.017 0.002 0.071 0.263
Orleans 7.3 6.5 9.6 76.96 1.6 158.1 43.2 0.503 0.361 0.857 0.056 0.021 0.001 0.065 0.172
Wyoming 7.0 8.2 10.3 75.61 3.5 155.1 42.6 0.534 0.391 0.888 0.040 0.015 0.002 0.056 0.181

This merge dataset is comprised of 62 rows and 16 columns of data, with first column being the name of the county, and other 15 beings all continuous variables of identified indicators of the whole 62 counties of NY state.

Making Map File

uscounties <- uscounties %>% 
  filter (state_id == "NY") %>%
  select (county, lat, lng)

map <- merge %>% 
  inner_join(uscounties, by ="county")

write.csv(map, "NY_map.csv", row.names = FALSE)

This file is specifically made for the purpose of making map (in Shiny App). We obtained latitude and longitude data of all NY counties from SimpleMaps Database (https://simplemaps.com/data/us-counties).

EDA

Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.

Descriptive Summary

First, we want to understand all the chosen variables we have. Since they are all continuous, we want to look at how many data points are being missing, their mean, median, and standard deviation. The code chunk below summarizes all of that statistics from our merge dataset.

merge_cont =
  merge %>% 
  select(-county)

skim_table <- skim(merge_cont)
numeric_summary <- skim_table %>%
  select(Variables = skim_variable, Nmiss = n_missing, Mean = numeric.mean, Median = numeric.p50, STD = numeric.sd)


kable(numeric_summary, "html", digits = 1) %>%
  kable_styling(full_width = FALSE, position = "center", font_size = 13) %>%
  column_spec(1, width = "3cm") %>%
  column_spec(2, width = "3cm") %>%
  column_spec(3, width = "3cm") %>%
  column_spec(4, width = "3cm") %>%
  column_spec(5, width = "2cm")
Variables Nmiss Mean Median STD
annual_pm2.5 0 7.0 6.8 1.1
percent_lowbirthweight 0 7.2 7.3 1.5
premature_percentage 1 8.8 8.9 1.1
cancer_mortality_per_10k 1 68.0 68.9 8.0
asthma_hosp_rate_per_10k 3 4.4 3.7 3.8
cardio_hosp_rate_per_10k 0 151.1 154.9 25.9
median_age 0 41.9 42.1 3.9
percent_male 0 0.5 0.5 0.0
percent_high_income 0 0.4 0.4 0.1
percent_non_hisp_white 0 0.8 0.9 0.2
percent_non_hisp_black 0 0.1 0.0 0.1
percent_hisp_white 0 0.0 0.0 0.0
percent_hisp_black 0 0.0 0.0 0.0
percent_other 0 0.1 0.1 0.1
percentage_high_education 0 0.3 0.3 0.1

The table provides descriptive statistics for various variables in the dataset. Key insights include the average annual_pm2.5 level being 7.0, with a small standard deviation of 1.1, indicating relatively consistent air quality. Additionally, variables such as premature_percentage and cancer_mortality_per_10k show slight variability with a small number of missing values, while hospitalization rates and demographic characteristics exhibit diverse distributions. Overall, the summary highlights the central tendency, variability, and distribution characteristics of the dataset’s variables.

Percentage of High Income

Hover over the interactive graph or zoom in for more detailed information.

Then, we aimed to look that the distribution of percent_high_income across all counties. We performed a general barplot that allowed us to understand the overall distribution.

merge %>%
 plot_ly(
    x = ~reorder(county, percent_high_income),
    y = ~percent_high_income,
    type = "bar",
    marker = list(color = "red1")
  ) %>%
  layout(
    title = "Percentage of High Income",
    xaxis = list(title = "County Name", categoryorder = "total descending"),
    yaxis = list(title = "Percentage"),
    barmode = "stack"
  )

The bar chart depicts the “Percentage of High Income” across 62 counties of New York State, ordered from the highest to the lowest percentage. Nassau county has the highest percentage of household with an annual income of more than $75,000 (categorized as high income), while that of the Bronx county is the lowest.

Racial Composition

Hover over the interactive graph or zoom in for more detailed information.

Due to the nature of variety of racial group indicators of each county, we wanted to make a stacked bargraph to see the different between each county’s racial composition.

race_plot <- merge %>% 
  select (county, percent_non_hisp_white, percent_non_hisp_black, percent_hisp_white, percent_hisp_black, percent_other) %>%
  pivot_longer(
    cols = starts_with("percent_"), names_to = "race", values_to = "percentage") 

race_plot%>%
  plot_ly(x = ~county, y = ~percentage, type = "bar",color = ~race,colors = "RdYlGn", hoverinfo = "y+name") %>% 
  layout(barmode = "stack",
         title = "Racial Composition in NY county",
         xaxis = list(title = "County"),
         yaxis = list(title = "Percentage (%)"))

This stacked bar chart displays the racial demographics across 62 New York counties. Each bar represents a county and is divided into segments that reflect the percentage of the county’s population of the following racial categories: non-Hispanic White, non-Hispanic Black, Hispanic White, Hispanic Black, and Other. The majority of New York State population is non-Hispanic-White, which is also the major racial group in almost all counties, except in Bronx county and Queens county.

5 Outcome Graphs

The prevalence of our 5 outcomes of interest across 62 counties are illustrated in these 5 graphs below. For each outcome, we have highlighted the TOP 5 to be circled with red color, while the bottom 5 will be circled with blue color, for easier understanding as there are many data points for NY State.

Low birthweight Rate

Hover over the interactive graph or zoom in for more detailed information.

# Create a scatter plot
LBR_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~percent_lowbirthweight,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Low Birthweight",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Percentage of Low Birthweight"),
    showlegend = FALSE
  )

# Identify the top and bottom points
top_points <- merge %>% arrange(desc(percent_lowbirthweight)) %>% slice_head(n = 5)
bottom_points <- merge %>% arrange(percent_lowbirthweight) %>% slice_head(n = 5)

# Identify the other points
suppressMessages({other_points <- merge %>%
  anti_join(top_points) %>%
  anti_join(bottom_points)})

# Add markers for the top and bottom points
LBR_graph <- LBR_graph %>%
  add_markers(
    data = top_points,
    x = ~county,
    y = ~percent_lowbirthweight,
    color = I("red"),  # Customize color for top points
    size = I(10),      # Customize marker size for top points
    name = "Top 5"
  ) %>%
  add_markers(
    data = bottom_points,
    x = ~county,
    y = ~percent_lowbirthweight,
    color = I("blue"),  # Customize color for bottom points
    size = I(10),       # Customize marker size for bottom points
    name = "Bottom 5") %>%
  add_markers(
    data = other_points,
    x = ~county,
    y = ~percent_lowbirthweight,
    color = I("grey"),  # Customize color for other points
    size = I(5),        # Customize marker size for other points
    name = "Other Points"
  )

# Display the plot
LBR_graph

TOP 5: (Bronx, 10.2), (Schenectady, 9.2), (Monroe, 9.1), (Fulton, 8.8), (Erie, 8.8) .

BOTTOM 5: (Hamilton,0), (Schuyler,2.8), (Yates, 4), (Schoharie, 5.1), (Rockland, 5.6).

Premature Birth

Hover over the interactive graph or zoom in for more detailed information.

# Create a scatter plot
premature_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~premature_percentage,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Premature Birth",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Percentage of Premature Birth"),
    showlegend = FALSE
  )

# Identify the top and bottom points
top_points1 <- merge %>% arrange(desc(premature_percentage)) %>% slice_head(n = 5)
bottom_points1 <- merge %>% arrange(premature_percentage) %>% slice_head(n = 5)

# Identify the other points
suppressMessages({other_points1 <- merge %>%
  anti_join(top_points1) %>%
  anti_join(bottom_points1)})

# Add markers for the top and bottom points
premature_graph <- premature_graph %>%
  add_markers(
    data = top_points1,
    x = ~county,
    y = ~premature_percentage,
    color = I("red"),  # Customize color for top points
    size = I(10),      # Customize marker size for top points
    name = "Top 5"
  ) %>%
  add_markers(
    data = bottom_points1,
    x = ~county,
    y = ~premature_percentage,
    color = I("blue"),  # Customize color for bottom points
    size = I(10),       # Customize marker size for bottom points
    name = "Bottom 5") %>%
  add_markers(
    data = other_points1,
    x = ~county,
    y = ~premature_percentage,
    color = I("grey"),  # Customize color for other points
    size = I(5),        # Customize marker size for other points
    name = "Other Points"
  )

# Display the plot
premature_graph

TOP 5: (Washington, 11.1), (Schenectady, 10.3), (Bronx, 10.3), (Sullivan, 10.3), (Wyoming, 10.3) .

BOTTOM 5: (Schuyler,5.2), (Yates , 6), (Rockland, 6.3), (Tompskin, 6.5), (Lewis, 7).

Asthma Hospitalization

Hover over the interactive graph or zoom in for more detailed information.

# Create a scatter plot
asthma_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~asthma_hosp_rate_per_10k,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Asthma Hospitalization",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Rate per 10k"),
    showlegend = FALSE
  )

# Identify the top and bottom points
top_points2 <- merge %>% arrange(desc(asthma_hosp_rate_per_10k)) %>% slice_head(n = 5)
bottom_points2 <- merge %>% arrange(asthma_hosp_rate_per_10k) %>% slice_head(n = 5)

# Identify the other points
suppressMessages({other_points2 <- merge %>%
  anti_join(top_points2) %>%
  anti_join(bottom_points2)})

# Add markers for the top and bottom points
asthma_graph <- asthma_graph %>%
  add_markers(
    data = top_points2,
    x = ~county,
    y = ~asthma_hosp_rate_per_10k,
    color = I("red"),  # Customize color for top points
    size = I(10),      # Customize marker size for top points
    name = "Top 5"
  ) %>%
  add_markers(
    data = bottom_points2,
    x = ~county,
    y = ~asthma_hosp_rate_per_10k,
    color = I("blue"),  # Customize color for bottom points
    size = I(10),       # Customize marker size for bottom points
    name = "Bottom 5") %>%
  add_markers(
    data = other_points2,
    x = ~county,
    y = ~asthma_hosp_rate_per_10k,
    color = I("grey"),  # Customize color for other points
    size = I(5),        # Customize marker size for other points
    name = "Other Points"
  )

# Display the plot
asthma_graph

TOP 5: (Bronx, 28.4), (Kings, 10.7), (New York, 8.4), (Richmond, 7.9), (Queens, 7.8) .

BOTTOM 5: (Orleans,1.6), (Seneca , 1.7), (Tompskin, 1.7), (Fulton, 1.7), (Madison, 1.8).

Cancer Mortality Rate

Hover over the interactive graph or zoom in for more detailed information.

# Create a scatter plot
cancer_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~cancer_mortality_per_10k,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Cancer Mortality",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Rate per 10k"),
    showlegend = FALSE
  )

# Identify the top and bottom points
top_points3 <- merge %>% arrange(desc(cancer_mortality_per_10k)) %>% slice_head(n = 5)
bottom_points3 <- merge %>% arrange(cancer_mortality_per_10k) %>% slice_head(n = 5)

# Identify the other points
suppressMessages({other_points3 <- merge %>%
  anti_join(top_points3) %>%
  anti_join(bottom_points3)})

# Add markers for the top and bottom points
cancer_graph <- cancer_graph %>%
  add_markers(
    data = top_points3,
    x = ~county,
    y = ~cancer_mortality_per_10k,
    color = I("red"),  # Customize color for top points
    size = I(10),      # Customize marker size for top points
    name = "Top 5"
  ) %>%
  add_markers(
    data = bottom_points3,
    x = ~county,
    y = ~cancer_mortality_per_10k,
    color = I("blue"),  # Customize color for bottom points
    size = I(10),       # Customize marker size for bottom points
    name = "Bottom 5") %>%
  add_markers(
    data = other_points3,
    x = ~county,
    y = ~cancer_mortality_per_10k,
    color = I("grey"),  # Customize color for other points
    size = I(5),        # Customize marker size for other points
    name = "Other Points"
  )

# Display the plot
cancer_graph

TOP 5: (Columbia, 82.23), (Warren, 80.66), (Greene, 79.08), (Niagara, 78.86), (Schuyler, 77.66) .

BOTTOM 5: (Bronx ,45.87), (Kings , 47.96), (Tompskin, 49.56), (Queens, 51.6), (Orange, 53.48) .

Cardiovascular Rate

Hover over the interactive graph or zoom in for more detailed information.

# Create a scatter plot
cardio_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~cardio_hosp_rate_per_10k,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Cardiovascular Disease Hospitalization",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Rate per 10k"),
    showlegend = FALSE
  )

# Identify the top and bottom points
top_points4 <- merge %>% arrange(desc(cardio_hosp_rate_per_10k)) %>% slice_head(n = 5)
bottom_points4 <- merge %>% arrange(cardio_hosp_rate_per_10k) %>% slice_head(n = 5)

# Identify the other points
suppressMessages({other_points4 <- merge %>%
  anti_join(top_points4) %>%
  anti_join(bottom_points4)})

# Add markers for the top and bottom points
cardio_graph <- cardio_graph %>%
  add_markers(
    data = top_points4,
    x = ~county,
    y = ~cardio_hosp_rate_per_10k,
    color = I("red"),  # Customize color for top points
    size = I(10),      # Customize marker size for top points
    name = "Top 5"
  ) %>%
  add_markers(
    data = bottom_points4,
    x = ~county,
    y = ~cardio_hosp_rate_per_10k,
    color = I("blue"),  # Customize color for bottom points
    size = I(10),       # Customize marker size for bottom points
    name = "Bottom 5") %>%
  add_markers(
    data = other_points4,
    x = ~county,
    y = ~cardio_hosp_rate_per_10k,
    color = I("grey"),  # Customize color for other points
    size = I(5),        # Customize marker size for other points
    name = "Other Points"
  )

# Display the plot
cardio_graph

TOP 5: (Montgomery, 200.1), (Greene, 192.3), (Wayne, 191), (Niagara, 182.8), (Sullivan, 181.8) .

BOTTOM 5: (Tompskin ,80.4), (Tioga , 84.3), (Chautauqua, 90.2), (Rockland, 104.4), (Essex, 105.8) .

Statistical Analysis

If you undertake formal statistical analyses, describe these in detail

Method

After cleaning the data, we then look at the effect that the annual concentration of pm 2.5 separately has on multiple outcomes among 62 counties in New York, including low birth weight rate, premature birth rate, cancer mortality rate, asthma-related hospitalization rate, and cardiovascular-disease-related hospitalization rate.

Since all of the outcomes are measured as rates (i.e., continuous), we use linear regression (LR) models to assess the effect of annual concentration of pm 2.5 on the outcomes of interests, adjusting for age, sex, ethnicity, education, income. In our models:

  • Age is reported as a continuous variable that reflects the median age of each county.

  • Sex is reported as a continuous variable that reflects the percentage of male in each county. The variable that reflects the percentage of female in each county is not included in our model to avoid perfect multicollinearity.

  • Ethnicity is reported as a continuous variable that reflects the percentage of different ethnicity groups in each county, including Hispanic-Black, Hispanic-White, non-Hispanic-Black, non-Hispanic-White, and Others (if Asian, Native American, or belong to two or more race). The variable that reflects the percentage of people who belong to the category Others is not included in our model to avoid perfect multicollinearity.

  • Education is reported as a continuous variable that reflects the percentage of people who have obtained one or more higher education degrees in each county.

  • Income is reported as a continuous variable that reflects the percentage of household of each county that have annual income exceeding $75,000.

For each of the five outcomes of interest, we first start with the full model:

Outcome = Annual pm 2.5 concentration + median age + percentage of male + percentage of Hispanic-Black + percentage of Hispanic-White + percentage of non-Hispanic-Black + percentage of non-Hispanic-White + percentage of household income exceeding 75,000 USD + percentage of people obtained higher education

We then perform bidirectional stepwise selection on all five models and output the best models. The full model and the resulting best models after stepwise selection for each outcome are detailed below:

LR models by Outcomes

Low Birth Weight Rate

Stepwise regression

Full model

lm_pm2.5_birthweight_adjusted <-lm(percent_lowbirthweight ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)

Stepwise selection

lm_pm2.5_birthweight_adjusted_best <- step(lm_pm2.5_birthweight_adjusted, direction = 'both', trace=FALSE)

summary(lm_pm2.5_birthweight_adjusted_best)%>%
  tab_model()
  percent lowbirthweight
Predictors Estimates CI p
(Intercept) 11.53 7.18 – 15.88 <0.001
median age -0.11 -0.21 – -0.02 0.024
percent non hisp black 8.05 1.34 – 14.76 0.019
Observations 62
R2 / R2 adjusted 0.274 / 0.249

Interpretation

The best model that was outputted by the stepwise regression is

low birth weight = 11.53 - 0.11(median age) + 8.05 (percent_non_hisp_black)

  • On average, for a county whose median age is 0 and has 0% of non-Hispanic-Black, the expected low birth weight rate is 11.5% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.

  • On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

  • On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.05%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.

The adjusted R-squared of 0.249 implies that 24.9% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).

Premature Birth Weight Rate

Stepwise regression

Full model

lm_pm2.5_premature_adjusted <-lm(premature_percentage ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)

Stepwise selection

lm_pm2.5_premature_adjusted_best <- step(lm_pm2.5_premature_adjusted, direction = 'both', trace=FALSE)

summary(lm_pm2.5_premature_adjusted_best)%>%
  tab_model()
  premature percentage
Predictors Estimates CI p
(Intercept) 4.88 0.96 – 8.80 0.016
median age 0.08 -0.01 – 0.17 0.067
percent non hisp black 8.03 2.63 – 13.42 0.004
Observations 61
R2 / R2 adjusted 0.135 / 0.105

Interpretation

The best model that was outputted by the stepwise regression is

premature birth weight = 4.88 + 0.08(median age) + 8.03 (percent_non_hisp_black)

  • On average, for a county whose median age is 0 and has 0% of non-Hispanic-Black, the expected low birth weight rate is 4.88% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.

  • On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

  • On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.03%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.

The adjusted R-squared of 0.105 implies that 10.5% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).

Cancer Mortality Rate

Detailed below are the the initial model, the final model and the interpretation for each outcome.

Stepwise regression

Full model

lm_pm2.5_cancer_adjusted <-lm(cancer_mortality_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)

Stepwise selection

lm_pm2.5_cancer_adjusted_best<-step(lm_pm2.5_cancer_adjusted, direction = 'both', trace=FALSE)

summary (lm_pm2.5_cancer_adjusted_best)%>%
  tab_model()
  cancer mortality per 10 k
Predictors Estimates CI p
(Intercept) -0.18 -51.73 – 51.36 0.994
annual pm2 5 1.33 0.10 – 2.55 0.035
median age 1.31 0.98 – 1.65 <0.001
percent high income -21.35 -37.32 – -5.38 0.010
percent non hisp white 46.40 20.09 – 72.71 0.001
percent non hisp black 53.63 2.27 – 104.99 0.041
percent hisp white 102.38 10.35 – 194.41 0.030
percent hisp black -178.74 -417.28 – 59.79 0.139
percent male -57.41 -140.92 – 26.09 0.174
Observations 61
R2 / R2 adjusted 0.809 / 0.780

Interpretation

The best model that was outputted by the stepwise regression is

cancer mortality rate = -1.85 + 13.27(annual pm 2.5) + 13.14(median age) - 213.49(percent high income) + 464.03(percent non hisp white) + 536.28(percent non hisp black) + 1023.82(percent hisp white) - 1787.43(percent hisp black) - 574.15(percent male)

  • On average, for a county whose annual pm 2.5 concentration is 0, median age is 0, has 0% of high income, has 0% of non-Hispanic-Black, has 0% of non-Hispanic-White, has 0% of Hispanic-Black, has 0% of Hispanic-White, has 0% of male, the expected low birth weight rate is -1.85 case per 100,000 people (no practical meaning). However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the intercept is significantly different than 0.

  • On average, for every unit increase of annual pm 2.5 concentration, the expected cancer mortality rate of the county increases by 13.27 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for annual pm 2.5 concentration is significantly different than 0.

  • On average, for every unit increase of median age (1 year), the expected cancer mortality rate of the county decreases by 13.14 cases per 100,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

  • On average, for every unit increase in the percentage of high income, the expected cancer mortality rate of the county decreases by 213.49 cases per 100,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of high income is significantly different than 0.

  • On average, for every unit increase of percentage of non-Hispanic-White, the expected cancer mortality rate of the county increases by 464.03 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-White is significantly different than 0.

  • On average, for every unit increase of percentage of non-Hispanic-Black, the expected cancer mortality rate of the county increases by 536.28 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.

  • On average, for every unit increase of percentage Hispanic-White, the expected cancer mortality rate of the county increases by 1023.08 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of Hispanic-White is significantly different than 0.

  • On average, for every unit increase of percentage Hispanic-Black, the expected cancer mortality rate of the county decreases by 1787.43 deaths per 100,00 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient percentage of Hispanic-Black is significantly different than 0.

  • On average, for every unit increase of percentage of male, the expected cancer mortality rate of the county decreases by 574.15 deaths per 100,00 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for the percentage of male is significantly different than 0.

The adjusted R-squared of 0.780 implies that 70.8% of the variation in the response variable can be explained by its linear relationship with the set of the 8 predictors (annual pm 2.5 conc., median age, percentage of high income, percentage of 4 ethnicity groups, percentage of male).

Asthma Hospitalization Rate

Detailed below are the the initial model, the final model and the interpretation for each outcome.

Stepwise regression

Full model

lm_pm2.5_asthma_adjusted <-lm(asthma_hosp_rate_per_10k ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)

Stepwise selection

lm_pm2.5_asthma_adjusted_best <- step(lm_pm2.5_asthma_adjusted, direction = 'both', trace=FALSE)

summary(lm_pm2.5_asthma_adjusted_best)%>%
  tab_model()
  asthma hosp rate per 10 k
Predictors Estimates CI p
(Intercept) 1.36 0.06 – 2.67 0.041
percent high income 4.65 0.32 – 8.99 0.036
percent non hisp black 13.52 5.77 – 21.26 0.001
percent hisp black 366.67 314.84 – 418.50 <0.001
percentage high education -5.31 -9.55 – -1.06 0.015
Observations 59
R2 / R2 adjusted 0.935 / 0.930

Interpretation

The best model that was outputted by the stepwise regression is

asthma hospitalization rate = 1.36 + 4.65(percent high income) + 13.52(percent non hisp black) + 366.67 (percent hisp black) - 5.31(percent high education)

  • On average, for a county who has zero percent of high income, has zero percent of non-Hispanic-Black, has zero percent of Hispanic-Black, has zero percent of high education, the expected asthma hospitalization rate is 1.85 case per 10,000 people (no practical meaning). Since the p-value is smaller han 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.

  • On average, for every unit increase in the percentage of high income, the expected asthma hospitalization rate of the county increases by 4.65 cases per 10,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of high incomeis significantly different than 0.

  • On average, for every unit increase of percentage of non-Hispanic-Black, the expected cancer mortality rate of the county increases by 13.52 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of non-Hispanic-Black is significantly different than 0.

  • On average, for every unit increase of percentage of Hispanic-Black, the expected cancer mortality rate of the county increases by 366.67 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of Hispanic-Black is significantly different than 0.

  • On average, for every unit increase of percentage of high education, the expected cancer mortality rate of the county decreases by 5.31 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of higher education is significantly different than 0.

The adjusted R-squared of 0.930 implies that 93.0% of the variation in the response variable can be explained by its linear relationship with the set of the 4 predictors (percentage of high income, percentage of non-Hispanic-Black, percentage of Hispanic-Black, percentage of higher education).

Cardiovascular Disease Rate

Detailed below are the the initial model, the final model and the interpretation for each outcome.

Stepwise regression

Full model

lm_pm2.5_cardio_adjusted <-lm(cardio_hosp_rate_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)

Stepwise selection

lm_pm2.5_cardio_adjusted_best<-step(lm_pm2.5_cardio_adjusted, direction = 'both', trace=FALSE)

summary (lm_pm2.5_cardio_adjusted_best)%>%
  tab_model()
  cardio hosp rate per 10 k
Predictors Estimates CI p
(Intercept) 157.52 -155.87 – 470.91 0.318
annual pm2 5 6.97 -0.14 – 14.08 0.055
median age 1.73 0.01 – 3.44 0.049
percent non hisp white 117.26 -30.46 – 264.97 0.117
percent non hisp black 222.99 -83.73 – 529.70 0.151
percent hisp white 439.56 -49.52 – 928.63 0.077
percentage high education -148.61 -224.74 – -72.49 <0.001
percent male -405.53 -920.48 – 109.43 0.120
Observations 62
R2 / R2 adjusted 0.329 / 0.242

Interpretation

The best model that was outputted by the stepwise regression is

cardiovascular disease hospitalization rate = 157.52 + 6.97(annual pm 2.5) + 1.73(median age) + 117.26 (percent non hisp white) + 222.99(percent non hisp black) + 439.56(percent hisp white) - 148.61.43(percent high education) - 405.53(percent male)

On average, for a county whose annual pm 2.5 concentration is 0, median age is 0, has zero percent of non-Hispanic-White, has zero percent of non-Hispanic-Black, has zero percent of Hispanic-White, has zero percent of high education, has zero percent of male, the expected cardiovascular disease hospitalization rate is 157.52 case per 10,000 people (no practical meaning). However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the intercept is significantly different than 0.

  • On average, for every unit increase of annual pm 2.5 concentration, the expected cardiovascular disease hospitalization rate of the county increases by 6.97cases per 10,000 people. However, since the p-value is smaller than 0.05, we do not have sufficient evidence to claim that the beta coefficient for annual pm 2.5 concentration is significantly different than 0.

  • On average, for every unit increase of median age (1 year), the expected cardiovascular disease hospitalization rate of the county decreases by 1.73 cases per 10,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

  • On average, for every unit increase of percent of non-Hispanic-White, the expected cardiovascular disease hospitalization rate of the county increases by 177.26 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for the percent of non-Hispanic-White is significantly different than 0.

  • On average, for every unit increase of percent of non-Hispanic-Black, the expected cardiovascular disease hospitalization rate of the county increases by 222.99 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percent of non-Hispanic-Black is significantly different than 0.

  • On average, for every unit increase of percent of Hispanic-White, the expected cardiovascular disease hospitalization rate of the county increases by 439.56 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percent of Hispanic-White is significantly different than 0.

  • On average, for every unit increase of percent of higher education, the expected cardiovascular disease hospitalization rate of the county decreases by 148.61 cases per 10,000 people. Since the p-value is larger than 0.05, we have sufficient evidence to claim that the beta coefficient for percent of higher education is significantly different than 0.

  • On average, for every unit increase of percentage of male, the expected cardiovascular disease hospitalization rate of the county decreases by 405.53 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percentage male is significantly different than 0.

The adjusted R-squared of 0.242 implies that 24.2% of the variation in the response variable can be explained by its linear relationship with the set of the 7 predictors (annual pm 2.5 concentration, median age, percent non-Hispanic-White, percent non-Hispanic-Black, percent Hispanic-White, percentage of higher education, percentage of male).

Discussion: KEVIN

What were your findings? Are they what you expect? What insights into the data can you make?